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Abstract 

Most recent research in trainable part of speech 
taggers has explored stochastic tagging. While 
these taggers obtain high accuracy, linguistic in- 
formation is captured indirectly, typically in tens 
of thous ands of lex ical and contextual probabili- 
ties. In (Brill 1992), a trainable rule-based tagger 
was described that obtained performance compa- 
rable to that of stochastic taggers, but captured 
relevant linguistic information in a small number 



ot simple non-stochastic rules. In this paper, we 
deHcTlbi! a hlimbeT Of tixLUllHlOllH lO Lilly I'Llle-baHed 

tagger. First, we describe a method for expressing 
lexical relations in tagging that stochastic taggers 
are currently unable to express. Next, we show 
a rule-based approach to tagging unknown words. 
Finally, we show how the tagger can be extended 
into a k-best tagger, where multiple tags can be 
assigned to words in some cases of uncertainty. 



Introduction 

When aut omated part of speech tagging was in itially 
explored (Klein & Simmons 1963; Harris 1962), peo- 
ple manually engineered rules for tagging, sometimes 
with the aid of a corpus. As large corpora be- 
came available, it became clear that simple Markov- 
model based stochastic taggers that were automatic ally 
trained cou ld achieve high rates of tagging accuracy ( [Ie-| 
linck 1985). Markov-model based taggers assign a sen- 
tence the tag sequence that maximizes Pro6(word|tag)* 
Profe(tag|previous n tags). These probabilities can be 
estimated directly from a manually tagged corpus.^] 

"This paper appears in the proceedings of the Twelfth 
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Stochastic taggers have a number of advantages over 
the manually built taggers, including obviating the need 
for laborious manual rule construction, and possibly 
capturing useful information that may not have been 
noticed by the human engineer. However, stochas- 
tic taggers have the disadvantage that linguistic infor- 
mation is only captured indirectly, in large tables of 
statistics. Almost all recent work in developing au- 
tomatically trained part of speech taggers has b een 
on further exploring Markov-model based tagging ( Je- 
linck 1985|; IChurch 1988|; pcrosc 1988|; |L)cMarckcn 199(1 ; 



Vlerialdo 1991|; jCutting et al. 1992|; [Kupicc 199; ; 



CJharniak et ai. 1993 



In (Brill 1992), a trainable rule-based tagger is de- 
scribed that achieves performance comparable to that 
of stochastic taggers. Training this tagger is fully au- 
tomated, but unlike trainable stochastic taggers, lin- 
guistic information is encoded directly in a set of sim- 
ple non-stochastic rules. In this paper, we describe 
some extensions to this rule-based tagger. These in- 
clude a rule-based approach to: lexicalizing the tag- 
ger, tagging unknown words, and assigning the k-best 
tags to a word. All of these extensions, as well as the 
original tagger, are based upon a learning paradigm 
called transformation-based error-driven learning. This 
learning paradigm has shown promise in a number of 
other areas of natural language processing, and we hope 
that the extensions to transformation-based learning 
described in this paper can carry over to other domains 
of application as wcll.^] 

Transformation-Based Error-Driven 
Learning 

Transformation-based error-driven learning has been 
applied to a number of natural language problems, in- 
cluding part of speech tagging, prepositional phras e at 



tachment disambi g uation, and syntactic parsing (Brill 



telli aence (www.aaai.org). All rights reserved 



1992| |Brill 1993a| ; |Brill 1993b| ). A simila r approach 



^ne can also estimate these probabilities without a is being exp lored for machine translation ( Su, Wu, fc 



man rally tagged corpus, using a hidden Markov model. 
However, it appears to be the case that directly estimat- 
ing probabilities from even a very small manually tagged 
corpus gives better results than trainin g a hidden Ma rkov 
model on a large untagged corpus (see (Merialdo 1991)). 



Chang 1992). Figure [j] illustrates the learning pro 



cess. First, unannotated text is passed through the 



2 The programs described in this paper can be obtained 
by contacting the author. 
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Figure 1: Transformation-Based Error-Driven Learn- 
ing. 



initial-state annotator. The initial-state annotator can 
range in complexity from assigning random structure to 
assigning the output of a sophisticated manually cre- 
ated annotator. Once text has been passed through 
the initial-state annotator, it is then compared to the 
and transformations are learned that can be ap- 
plied to the output of the initial state annotator to make 
it better resemble the truth. 

In all of the applications described in this paper, 
the following greedy search is applied: at each iter- 
ation of learning, the transformation is found whose 
application results in the highest score; that transfor- 
mation is then added to the ordered transformation 
list and the training corpus is updated by applying the 
learned transformation. To define a specific application 
of transformation-based learning, one must specify the 
following: (1) the initial state annotator, (2) the space 
of transformations the learner is allowed to examine, 
and (3) the scoring function for comparing the corpus 
to the truth and choosing a transformation. 

Once an ordered list of transformations is learned, 
new text can be annotated by first applying the ini- 
tial state annotator to it and then applying each of the 
learned transformations, in order. 

An Earlier Tranformation-Based Tagger 



The original transformation-based tagger (Brill 1992) 
works as follows. The initial state annotator assigns 
each word its most likely tag as indicated in the train- 



ing corpus. The most likely tag for unknown words is 
guessed based on a number of features, such as whether 
the word is capitalized, and what the last three letters of 
the word are. The allowable transformation templates 
are: 

Change tag a to tag b when: 

1. The preceding (following) word is tagged z. 

2. The word two before (after) is tagged z. 

3. One of the two preceding (following) words is tagged 
z. 

4. One of the three preceding (following) words is tagged 
z. 

5. The preceding word is tagged z and the following 
word is tagged w. 

6. The preceding (following) word is tagged z and the 
word two before (after) is tagged w. 

where a, b, z and w are variables over the set of parts 
of speech. To learn a transformation, the learner in 
essence applies every possible transformation,^ counts 
the number of tagging errors after that transformation 
is applied, and chooses that transformation resulting in 
the greatest error reduction. H Learning stops when no 
transformations can be foundwhose application reduces 
errors beyond some prespecified threshold. An example 
of a transformation that was learned is: change the 
tagging of a word from noun to verb if the previous 
word is tagged as a modal. Once the system is trained, 
a new sentence is tagged by applying the initial state 
annotator and then applying each transformation, in 
turn, to the sentence. 

Lexicalizing the Tagger 

No relationships between words are directly captured in 
stochastic taggers. In the Markov model, state transi- 
tion probabilities [P{Tagi\Tagi-i . . . Tagi- n )) express 
the likelihood of a tag immediately following n other 
tags, and emit probabilities (P(Wordj \Tagi)) express 
the likelihood of a word given a tag. Many useful rela- 
tionships, such as that between a word and the previous 
word, or between a tag and the following word, are not 
directly captured by Markov-model based taggers. The 
same is true of the earlier transformation-based tagger, 
where transformation templates did not make reference 
to words. 

To remedy this problem, the transformation-based 
tagger was extended by adding contextual transforma- 
tions that could make reference to words as well as part 
of speech tags. The transformation templates that were 
added are: 

Change tag a to tag b when: 



As specified in a manually annotated corpus. 



4 A11 possible instantiations of transformation templates. 
J The search is data-driven, so only a very small percent- 
age of possible transformations really need be examined. 



1. The preceding (following) word is w. 

2. The word two before (after) is w. 

3. One of the two preceding (following) words is w. 

4. The current word is w and the preceding (following) 
word is x. 

5. The current word is w and the preceding (following) 
word is tagged z. 

where w and x are variables over all words in the train- 
ing corpus, and z is a variable over all parts of speech. 

Below we list two lexicalized transformations that 
were learned^] 

Change the tag: 

(12) From preposition to adverb if the word two po- 
sitions to the right is as. 

(16) From non-3rd person singular present verb 
to base form verb if one of the previous two words is 
n't|| 



The Penn Treebank tagging style manual specifies 
that in the collocation as . . . as, the first as is tagged 
as an adverb and the second is tagged as a preposition. 
Since as is most frequently tagged as a preposition in 
the training corpus, the initial state tagger will mistag 
the phrase as tall as as: 

as/preposition tall/adjective as/preposition 

The first lexicalized transformation corrects this 
mistagging. Note that a stochastic tagger trained on 
our training set would not correctly tag the first oc- 
currence of as. Although adverbs are more likely than 
prepositions to follow some verb form tags, the fact that 
P(as\preposition) is much greater than P(as\adverb), 
and P(adjective\preposition) is much greater than 
P(adj ective\adverb) lead to as being incorrectly tagged 
as a preposition by a stochastic tagger. A trigram tag- 
ger will correctly tag this collocation in some instances, 
due to the fact that P(preposition\&dverh adjective) is 
greater than P(preposition\preposition adjective), but 
the outcome will be highly dependent upon the context 
in which this collocation appears. 

The second transformation arises from the fact that 
when a verb appears in a context such as We do n't 

or We did n't usually , the verb is in base form. 

A stochastic trigram tagger would have to capture this 
linguistic information indirectly from frequency counts 
of all trigrams of the form:[] 



°A11 experiments were run on the Pen n Treebank tagged 



Wall Street Journal co rpus, version 0.5 (Marcus, Santorini 



Marcinkiewicz 1993) 



'In the Penn Treebank, n't is treated as a separate token, 
so don't becomes do/VB-NON3rd-SING n't/ 'ADVERB. 
8 Where a star can match any part of speech tag. 
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Table 1: Comparison of Tagging Accuracy With No 
Unknown Words 
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ADVERB 



ADVERB 
ADVERB 
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BASE.VERB 
PRESENT.VERB 
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and fr om the fact that P(n't\A DV ERB) is fairly high. 

In (Weischedel et al. 1993), results are given when 
training and testing a Markov-model based tagger on 
the Penn Treebank Tagged Wall Street Journal Cor- 
pus. They cite results making the closed vocabulary 
assumption that all possible tags for all words in the 
test set are known. When training contextual proba- 
bilities on 1 million words, an accuracy of 96.7% was 
achieved. Accuracy dropped to 96.3% when contextual 
probabilities were trained on 64,000 words. We trained 
the transformation-based tagger on 600,000 words from 
the same corpus, making the same closed vocabulary as- 
sumption,^ and achieved an accuracy of 97.2% on a sep- 
arate 150,000 word test set. The transformation-based 
learner achieved better performance, despite the fact 
that contextual information was captured in only 267 
simple nonstochastic rules, as opposed to 10,000 con- 
textual probabilities that were learned by the stochas- 
tic tagger. To see whether lexicalized transformations 
were contributing to the accuracy rate, we ran the ex- 
act same test using the tagger trained using the ear- 
lier transformation template set, which contained no 
transformations making reference to words. Accuracy 
of that tagger was 96.9%. Disallowing lexicalized trans- 
formations resulted in an 11% increase in the error rate. 
These results are summarized in table [l| 

When transformations are allowed to make reference 
to words and word pairs, some relevant information is 



9 In both ( Weischedel et al. 199; ) and here, the test set 
was incorporated into the lexicon, but was not used in learn- 
ing contextual information. Testing with no unknown words 
might seem like an unrealistic test. We have done so for 
three reasons (We show results when unknown words are 
included later in the paper): (1) to allow for a comparison 
with previously quoted results, (2) to isolate known word 
accuracy from unknown word accuracy, and (3) in some sys- 
tems, such as a closed vocabulary speech recognition system, 
the assumption that all words are known is valid. 



probably missed due to sparse data. We are currently 
exploring the possibility of incorporating word classes 
into the rule-based learner in hopes of overcoming this 
problem. The idea is quite simple. Give n a source o f 
word class information, such as WordNet ( |Miller 199C ) , 
the learner is extended such that a rule is allowed to 
make reference to parts of speech, words, and word 
classes, allowing for rules such as Change the tag from 
X to Y if the following word belongs to word class Z. 
This approach has already been successfully applied to 



a system for prepositional phrase disambiguation ( Brill 
199@. 

Unknown Words 

In addition to not being lexicalized, another problem 
with the original transformation-based tagger was its 
relatively low accuracy at tagging unknown words 
In the initial state annotator for tagging, words are as- 
signed their most likely tag, estimated from a training 
corpus. In the original formulation of the rule-based 
tagger, a rather ad- hoc algorithm was used to guess the 
most likely tag for words not appearing in the training 
corpus. To try to improve upon unknown word tag- 
ging accuracy, we built a transformation-based learner 
to learn rules for more accurately guessing the most 
likely tag for words not seen in the training corpus. If 
the most likely tag for unknown words can be assigned 
with high accuracy, then the contextual rules can be 
used to improve accuracy, as described above. 

In the transformation-based unknown-word tagger, 
the initial state annotator naively labels the most likely 
tag for unknown words as proper noun if capitalized 
and common noun otherwise]^] 

Below we list the set of allowable transformations: 

Change the tag of an unknown word (from X) to Y if: 

1. Deleting the prefix x, \x\ <= 4, results in a word (x 
is any string of length 1 to 4) . 

2. The first (1,2,3,4) characters of the word are x. 

3. Deleting the suffix x, \x\ <= 4, results in a word. 

4. The last (1,2,3,4) characters of the word are x. 

5. Adding the character string x as a suffix results in a 
word (\x\ <— 4). 

6. Adding the character string x as a prefix results in a 
word (\x\ <= 4). 

7. Word W ever appears immediately to the left (right) 
of the word. 

8. Character Z appears in the word. 



10 This section describes work done in part while the au- 
thor was at the University of Pennsylvania. 

n If we change the tagger to tag all unknown words as 
common nouns, then a number of rules are learned of the 
form: change tag to proper noun if the prefix is "E", 
since the learner is not provided with the concept of upper 
case in its set of transformation templates. 



An unannotated text can be used to check the con- 
ditions in all of the above transformation templates. 
Annotated text is necessary in training to measure the 
effect of transformations on tagging accuracy. Below 
are the first 10 transformation learned for tagging un- 
known words in the Wall Street Journal corpus: 

Change tag: 

1. From common noun to plural common noun if 

the word has suffix -s^| 

2. From common noun to number if the word has 
character . 

3. From common noun to adjective if the word has 
character - 

4. From common noun to past participle verb if 

the word has suffix -ed 

5. From common noun to gerund or present par- 
ticiple verb if the word has suffix -ing 

6. To adjective if adding the suffix -ly results in a word 

7. To adverb if the word has suffix -ly 

8. From common noun to number if the word $ ever 
appears immediately to the left 

9. From common noun to adjective if the word has 
suffix -al 

10. From noun to base form verb if the word would 
ever appears immediately to the left. 

Keep in mind that no specific affixes are prespecified. 
A transformation can make reference to any string of 
characters up to a bounded length. So while the first 
rule specifies the English suffix " s" , the rule learner also 
considered such nonsensical rules as: change a tag to 
adjective if the word has suffix "xhqr". Also, absolutely 
no English -sp ecific information need be prespecified in 
the learner 

We then ran the following experiment using 1.1 mil- 
lion words of the Penn Treebank Tagged Wall Street 
Journal Corpus. The first 950,000 words were used for 
training and the next 150,000 words were used for test- 
ing. Annotations of the test corpus were not used in 
any way to train the system. From the 950,000 word 
training corpus, 350,000 words were used to learn rules 
for tagging unknown words, and 600,000 words were 
used to learn contextual rules. 148 rules were learned 
for tagging unknown words, and 267 contextual tag- 
ging rules were learned. Unknown word accuracy on 
the test corpus was 85.0%, and overall tagging accu- 
racy on the test corpus was 96.5%. To our knowledge, 
this is the highest overall tagging accuracy ever quoted 



12 Note that this transformation will result in the mistag- 
ging of actress. The 17th learned rule fixes this probiem. 
This ruie states: change a tag from plural common noun 
to singular common noun if the word has suffix ss. 

13 This le arner has a lso been applied to tagging Old En- 
glish. See ( |Brill 1993b[ ). 



on the Penn Treebank Corpus when making the open 
vocab ulary assumption. 

In ( Weischedel et al. 1993| ), a statistical approach to 
tagging unknown words is shown. In this approach, a 
number of suffixes and important features are prespec- 
ified. Then, for unknown words: 

p(VK|T) = ^(unknown word|T) * 

p(Capitalize-feature|T) * p{suf fixes, hyphenation\T) 

Using this equation for unknown word emit probabil- 
ities within the stochastic tagger, an accuracy of 85% 
was obtained on the Wall Street Journal corpus. This 
portion of the stochastic model has over 1,000 parame- 
ters, with 10 8 possible unique emit probabilities, as op- 
posed to only 148 simple rules that are learned and used 
in the rule-based approach. We have obtained compa- 
rable performance on unknown words, while capturing 
the information in a much more concise and perspicu- 
ous manner, and without prespecifying any language- 
specific or corpus-specific information. 

K-Best Tags 

There are certain circumstances where one is will- 
ing to relax the one tag per word requirement in or- 
der to increase the probability tha t the correct tag 
will be assigned to ea ch word. In (DeMarcken 1990; 
Weischedel et al. 1993), k-best tags are assigned within 
a stochastic tagger by returning all tags within some 
threshold of probability of being correct for a particu- 
lar word. 

We can modify the transformation-based tagger to re- 
turn multiple tags for a word by making a simple mod- 
ification to the contextual transformations described 
above. The initial-state annotator is the tagging out- 
put of the transformation-based tagger described above. 
The allowable transformation templates are the same as 
the contextual transformation templates listed above, 
but with the action change tag X to tag Y modified to 
add tag X to tag Y or add tag X to word W. Instead 
of changing the tagging of a word, transformations now 
add alternative taggings to a word. 

When allowing more than one tag per word, there is 
a trade-off between accuracy and the average number of 
tags for each word. Ideally, we would like to achieve as 
large an increase in accuracy with as few extra tags as 
possible. Therefore, in training we find transformations 
that maximize precisely this function. 

In table ^| we present results from first using the one- 
tag-per-word transformation-based tagger described in 
the previous section and then applying the k-best tag 
transformations. These transformations were learned 
from a separate 240,000 word corpus. As a baseline, 
we did k-best tagging of a test corpus as follows. Each 
known word in the test corpus was tagged with all tags 
seen with that word in the training corpus and the five 
most likely unknown word tags were assigned to all 
words not seen in the training corpus This resulted 

14 Thanks to Fred Jelinek and Fernando Pereira for sug- 
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Table 2: Results from k-best tagging. 



in an accuracy of 99.0%, with an average of 2.28 tags 
per word. The rule-based tagger obtained the same ac- 
curacy with 1.43 tags per word, one third the number 
of additional tags as the baseline tagger.]^ 

Conclusions 

In this paper, we have described a number of extensions 
to previous work in rule-based part of speech tagging, 
including the ability to make use of lexical relationships 
previously unused in tagging, a new method for tagging 
unknown words, and a way to increase accuracy by re- 
turning more than one tag per word in some instances. 
We have demonstrated that the rule-based approach ob- 
tains competitive performance with stochastic taggers 
on tagging both unknown and known words. The rule- 
based tagger captures linguistic information in a small 
number of simple non-stochastic rules, as opposed to 
large numbers of lexical and contextual probabilities. 
Recently, we have begun to explore the possibility of 
extending these techniques to other problems, including 
learning pronunciation networks for speech recognition 
and learning mappings between sentences and semantic 
representations. 
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